Goto

Collaborating Authors

 gradient small stochastically


How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Neural Information Processing Systems

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex. If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon^{-2})$, improving the best known rate $O(\varepsilon^{-8/3})$. If $f(x)$ is nonconvex, to find its $\varepsilon$-approximate local minimum, we design an algorithm SGD5 with rate $\tilde{O}(\varepsilon^{-3.5})$,


Reviews: How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Neural Information Processing Systems

This work studies convergence rates of the gradients for convex composite objectives by combining Nesterov's tricks used for gradient descent with SGD. The authors provide three approaches which differ from each other only slightly and they provide the convergence rates for all the proposed approaches. My comments on this work are as follow: 1. It is indeed important to study convergence rates of gradients especially for non-convex problems. The authors motivate the readers by mentioning this but they assume convexity in their problem set-up.


How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Allen-Zhu, Zeyuan

Neural Information Processing Systems

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex. If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon {-2})$, improving the best known rate $O(\varepsilon {-8/3})$. This is no slower than the best known stochastic version of Newton's method in all parameter regimes. Papers published at the Neural Information Processing Systems Conference.


How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Allen-Zhu, Zeyuan

Neural Information Processing Systems

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex. If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon^{-2})$, improving the best known rate $O(\varepsilon^{-8/3})$. If $f(x)$ is nonconvex, to find its $\varepsilon$-approximate local minimum, we design an algorithm SGD5 with rate $\tilde{O}(\varepsilon^{-3.5})$, where previously SGD variants only achieve $\tilde{O}(\varepsilon^{-4})$. This is no slower than the best known stochastic version of Newton's method in all parameter regimes.


How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Allen-Zhu, Zeyuan

Neural Information Processing Systems

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex. If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon^{-2})$, improving the best known rate $O(\varepsilon^{-8/3})$. If $f(x)$ is nonconvex, to find its $\varepsilon$-approximate local minimum, we design an algorithm SGD5 with rate $\tilde{O}(\varepsilon^{-3.5})$, where previously SGD variants only achieve $\tilde{O}(\varepsilon^{-4})$. This is no slower than the best known stochastic version of Newton's method in all parameter regimes.